Table of Content
  1. DNN Structure
  2. Forward Propagation
    1. Linear Forward
    2. Activated Forward
    3. Matrices Dimension in FP
  3. Cost Function
  4. Backward Propagation
    1. Activated Backward
    2. Linear Backward
  5. Parameters Update
  6. Summary


It has been a long time since last update. In my recent work, I want to intuitively understand how deep learning work rather than using some deep learning frameworks. Therefore, I decide to program the deep learning code from the very beginning. After referring the tutorial of Andrew Ng, I write a simple DNN code for cat classification. The source code can be downloaded here.

DNN Structure

This is a full-connected DNN consisted of one input layer, one output layer and multiple hidden layers. The input layer is the standardized image2vector which is reshaped from the image \((x,y,channel)\) to \((x*y*channel,1)\). The output layer is a probability to show whether the input image is a cat or not. The L-layer DNN structure is shown below.


DNN

In my case, the input image is 64*64 with RGB channel, so the image2vector is (12288,1). The activation function of the last layer is sigmoid fucntion, the rest layers are ReLU function.

nn_layers=[12288,20,7,5,3,1]
nn_actication=["relu","relu","relu","relu","sigmoid"]

There are 6 layers in this DNN. Input layer has 12288 entrances, followed by 20 neurons in the 2nd layer, 7 neurons in the 3rd layer and so forth. The output layer (the 6th layer) has one neuron and can be activated by sigmoid function.

Forward Propagation

Forward propagation contains two parts:

  • Linear Forward
  • Activated Forward

Linear Forward

Similar to logistic regression, linear forward can be calculated by:
$$
Z^{l}=W^{l}A^{l-1}+b^{l}
$$
Where \(W^{l}\) is the weighted parameters and \(b^{l}\) is the bias in layer \(l\). The \(A^{l-1}\) is the output of layer \(l-1\). The input image2vector \(X\) can be treated as \(A^{0}\). Here is the Python code:

def linear_forward(A,W,b):
    Z=np.dot(W,A)+b
    return Z

Activated Forward

There are variety of options for the activation function, such as ReLU, Tanh, and Sigmoid. In my case, I use ReLU in most layers but the output layer which I use the Sigmoid function. Following is the code:

def relu(Z):
    A = np.maximum(0,Z)
    assert(A.shape == Z.shape)
    return A

def sigmoid(Z):
    A=1/(1+np.exp(-Z))
    assert(A.shape==Z.shape)
    return A

Matrices Dimension in FP

In order to accelarate the training rate, vectorization is very common in deep learning; therefore, we need to figure out the dimensions of matrices we applied in the deep learning network. Here take a simple model as an example.


3layers

Assuming that there are \(m\) images in trainning set, the DNN structure is [12288,7,1]. We need to figure out the dimensions of the \(A^{l}\), \(W^{l}\), \(b^{l}\), \(Z^{l}\) and \(\hat{Y}\).

First, for the input layer, it depend on the size of image. For example, if one image is 64*64 with RGB channel, it can be vectorized as a vector with shape \((12288,1)\). There are \(m\) training images; therefore the input matrix \(X=A^{0}\) is with shape \((12288,m)\).


Input X

Similar to \(X=A^{0}\), the \(A^{1}\) can be also deduced as a matrix with shape \((7,m)\) and the \(\hat{Y}=A^{2}\) is \((1,m)\). In the activated forward, we only activate \(Z^{l}\) to \(A^{l}\). It does not change the shape of \(Z^{l}\), therefore the shape of \(Z^{l}\) is the same to the shape of \(A^{l}\).

The \(W^{l}\) and \(b^{l}\) connect \(A^{l-1}\) to \(Z^{l}\). For example, to transform \(A^{0}(12288,m)\) matrix to \(Z^{1}(7,m)\) matrix, the \(W^{1}\) should be the shape of (7,12288), and \(b^{1}\) should be shape of \((7,1)\).

Sum it up, the shape of matrices in the full-connected deep learning:
$$
A^{l}.shape=(N^{l},m)\\
Z^{l}.shape=A^{l}\\
b^{l}.shape=(N^{l},1)\\
W^{l}.shape=(N^{l},N^{l-1})
$$
Where \(N^{l}\) is the number of neurons in layer \(l\) and \(m\) is the number of training images. Based on this, the DNN parameters can be initilized if we know the DNN structure.

def init_param(nn_layers):
    np.random.seed(1)
    param={}

    for i in range(1,len(nn_layers)):
        param["W"+str(i)]=np.random.randn(nn_layers[i],nn_layers[i-1])/np.sqrt(nn_layers[i-1])
        param["b"+str(i)]=np.zeros((nn_layers[i],1))

        assert(param["W"+str(i)].shape==(nn_layers[i],nn_layers[i-1]))
        assert(param["b"+str(i)].shape==(nn_layers[i],1))

    return param

Cost Function

Cost function can help us check if the DNN model is actually learning. The value of cost function should be decend in each iteration. In my case, I use corss-entropy cost \(J\) to check \(\hat{Y}\) and the ground truth \(Y\). There are \(m\) training image, so we need to check their average cost:
$$
J=-\frac{1}{m}\sum_{1}^{m}(Y^{i}log(\hat{Y^{i}})+(1-Y^{i})log(1-\hat{Y^{i}}))
$$

def compute_cost(A_cache,Y):
    L=len(A_cache)-1
    A=A_cache["A"+str(L)] # get Y hat = last A
    m=Y.shape[1]

    cost=(-1.0/m)*np.sum(Y*np.log(A)+(1-Y)*np.log(1-A)) # use 1.0 is very important

    return cost

Backward Propagation

Backward propagation is used to calculate the gradient of the cost fucntion with respect to the parameter \(W\) and \(b\). At first, we need to calculate the gradient respected to the \(\hat{Y}\), the last layer output,treated as \(A^{L}\):
$$
dA^{L}=\frac{\partial{J}}{\partial{A^{L}}}=
-(\frac{Y}{A^{L}}-\frac{1-Y}{1-A^{L}})
$$
Next, according to chain rule, the gradient can be calculated layer by layer. The backward propagation contains two parts:

  • Activated backward
  • Linear backward

Activated Backward

For layer \(l\), assuming that the \(dA^{l}\) is already know. If the \(g(.)\) is the activation function, \(dZ^{l}\) can be caculated by:
$$
dZ^{l}=dA^{l}*g’(Z^{l})
$$
For ReLU and Sigmoid function, the derivative can be calculated by:
$$
RuLU’(z)= \begin{cases}
1 & \text{ if } z>0 \\
0 & \text{ if } z<=0
\end{cases}
$$
$$
Sigmoid’(z)=Sigmoid(z)(1-Sigmoid(z))
$$
Below is the python code in my case:

def partial_relu(Z):
    Z[Z<=0]=0
    Z[Z>0]=1
    pZ=Z
    assert(pZ.shape == Z.shape)
    return pZ

def partial_sigmoid(Z):
    A=sigmoid(Z)
    pZ=A*(1-A)
    assert(pZ.shape == Z.shape)
    return pZ

def activate_backward(dA,Z,mode):
    if mode=="sigmoid":
        pZ=partial_sigmoid(Z)
        dZ=dA*pZ
    if mode=="relu":
        pZ=partial_relu(Z)
        dZ=dA*pZ

    return dZ

Linear Backward

For layer \(l\), assuming that the \(dZ^{l}\) is already know. The \(dW^{l}\) and \(db^{l}\) can be calculated by:
$$
dW^{l}=dZ^{l}*A^{l-1}/m\\
db^{l}=dZ^{l}/m
$$

We have \(m\) training data, so \(dW^{l}\) is a \((N^{l},m)\) matrix and \(db^{l}\) is a \((1,m)\) matrix, which is the reason to divide by \(m\). It is also necessary to calculate \(dA^{l}\), because the \(dA^{l}\) is used for activated backward in layer \(l-1\).
$$
dA^{l}=W^{l}.^{T}*dZ^{l}
$$
Here is the code for linear backward:

def linear_backward(dZ,A_pre,W):
    m=dZ.shape[1]
    dW=np.dot(dZ,A_pre.T)/m
    dB=np.sum(dZ,axis=1,keepdims=True)/m
    dA=np.dot(W.T,dZ)

    assert(dA.shape==A_pre.shape)
    assert(dW.shape==W.shape)

    return dA,dW,dB

Parameters Update

For each iteration, the parameters should be updated by the gradient calculated in backward propagation. In gradient descend, learning rate is another parameter to control the network learning efficiency. If learning rate is too small, the training rate would be very slow. If it is too large, the network would not be able to convergence to good result.

def update_param(param,grads,learning_rate):
    L=len(param)//2 # get the number of layers

    for i in range(L):
        param["W"+str(i+1)]=param["W"+str(i+1)]-learning_rate*grads["dW"+str(i+1)]
        param["b"+str(i+1)]=param["b"+str(i+1)]-learning_rate*grads["dB"+str(i+1)]

    return param

The ‘param’ contains all the \(W,b\) in forward propagation, and ‘grads’ contains all the \(dW,db,dA\) in backward propagation. In my case, learning rate is 0.0075.

Summary

Here is the programming flow chat. For each layer \(l\), the input is \(A^{l-1}\) and the output is \(A^{l}\). The linear forward result \(Z^{l}\) can be stored in the cache. When doing backward propagation, the input is \(dA^{l}\) and the output is \(dA^{l-1}\). The \(dZ^{l}\) is very easy to be caculated by using \(Z^{l}\) in the cache. Similarly, the \(dW^{l}\) and \(db^{l}\) also need \(A^{l-1}\) in the cache.


Flow Chat

Ok, that is the process of how I program the full-connected DNN. The program code can be found in Github here.